{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "13781ca2-6df6-476b-bbf0-db65fd57857f",
   "metadata": {},
   "source": [
    "# L2: Filtering With Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "630f5433-d0aa-48c9-bd58-5916c14099b0",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px\"> ⏳ <b>Note <code>(Kernel Starting)</code>:</b> This notebook takes about 30 seconds to be ready to use. You may start and watch the video while you wait.</p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb2395da-11f5-4577-9dc6-e6660129ff41",
   "metadata": {
    "height": 64
   },
   "outputs": [],
   "source": [
    "# Warning control\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9cd7993c-fdee-407d-b485-e0c4779e5038",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "import custom_utils"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03466e3c-95d9-4d1a-8633-9be7084cfce9",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px\"> 💻 &nbsp; <b>Access <code>requirements.txt</code> and <code>utils</code> files:</b> To access <code>requirements.txt</code> for this notebook, 1) click on the <em>\"File\"</em> option on the top menu of the notebook and then 2) click on <em>\"Open\"</em>. For more help, please see the <em>\"Appendix - Tips and Help\"</em> Lesson.</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e70b37d-e581-40f2-a97e-c75487a51082",
   "metadata": {},
   "source": [
    "## Data Loading"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90830490-dfcb-4b37-bf85-911908a38927",
   "metadata": {
    "height": 183
   },
   "outputs": [],
   "source": [
    "# 1. Dataset Loading\n",
    "from datasets import load_dataset\n",
    "import pandas as pd\n",
    "\n",
    "dataset = load_dataset(\"MongoDB/airbnb_embeddings\", streaming=True, split=\"train\")\n",
    "dataset = dataset.take(100)\n",
    "# Convert the dataset to a pandas dataframe\n",
    "dataset_df = pd.DataFrame(dataset)\n",
    "dataset_df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf0ddaa2-3632-4fe6-8e0c-8713d655f4d9",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "print(\"Columns:\", dataset_df.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "264a8bb2-8c6e-4c14-a403-92f0bb2998b2",
   "metadata": {},
   "source": [
    "## Document Modelling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdafcd78-aa97-4376-a0d2-ca27ed34e08b",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "listings = custom_utils.process_records(dataset_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd3e25fa-37ec-46cd-8444-eded661c2613",
   "metadata": {},
   "source": [
    "## Database Creation and Connection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff2f3858-21c2-4946-80d7-fd7f86a53298",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "db, collection = custom_utils.connect_to_database()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a9db8cbd-cb43-4988-aca1-dbb6219bf425",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "# Delete any existing records in the collection\n",
    "collection.delete_many({})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "322d3a43-7643-47c2-9fbe-5fc12150d4c1",
   "metadata": {},
   "source": [
    "## Data Ingestion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a31b0fa9-d671-4404-8e31-4088f50b5771",
   "metadata": {
    "height": 64
   },
   "outputs": [],
   "source": [
    "# The ingestion process might take a few minutes\n",
    "collection.insert_many(listings)\n",
    "print(\"Data ingestion into MongoDB completed\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8519448f-3edc-46f6-9fb1-98dd7810efa9",
   "metadata": {},
   "source": [
    "## Vector Search Index defintion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5aa8a47d-e962-4d72-b444-3e6718a9b666",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "custom_utils.setup_vector_search_index(collection=collection)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49b7c725-831e-41a1-a534-2293cf0769b9",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px\"> ⏳ <b>Note:</b> If the output of the previous cell is <code>Error creating vector search index: Duplicate Index</code> you may proceed to the next cell if you intend to still use a previously created index.</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fe10702-8b54-491d-abaa-6ca6ba1a78f3",
   "metadata": {},
   "source": [
    "## Compose Vector Search Query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3612890d-8fd9-4c1c-a16a-9547eaefb10e",
   "metadata": {
    "height": 897
   },
   "outputs": [],
   "source": [
    "def vector_search(user_query, db, collection, additional_stages=[], vector_index=\"vector_index_text\"):\n",
    "    \"\"\"\n",
    "    Perform a vector search in the MongoDB collection based on the user query.\n",
    "\n",
    "    Args:\n",
    "    user_query (str): The user's query string.\n",
    "    db (MongoClient.database): The database object.\n",
    "    collection (MongoCollection): The MongoDB collection to search.\n",
    "    additional_stages (list): Additional aggregation stages to include in the pipeline.\n",
    "\n",
    "    Returns:\n",
    "    list: A list of matching documents.\n",
    "    \"\"\"\n",
    "\n",
    "    # Generate embedding for the user query\n",
    "    query_embedding = custom_utils.get_embedding(user_query)\n",
    "\n",
    "    if query_embedding is None:\n",
    "        return \"Invalid query or embedding generation failed.\"\n",
    "\n",
    "    # Define the vector search stage\n",
    "    vector_search_stage = {\n",
    "        \"$vectorSearch\": {\n",
    "            \"index\": vector_index, # specifies the index to use for the search\n",
    "            \"queryVector\": query_embedding, # the vector representing the query\n",
    "            \"path\": \"text_embeddings\", # field in the documents containing the vectors to search against\n",
    "            \"numCandidates\": 150, # number of candidate matches to consider\n",
    "            \"limit\": 20, # return top 20 matches\n",
    "        }\n",
    "    }\n",
    "\n",
    "    # Define the aggregate pipeline with the vector search stage and additional stages\n",
    "    pipeline = [vector_search_stage] + additional_stages\n",
    "\n",
    "    # Execute the search\n",
    "    results = collection.aggregate(pipeline)\n",
    "\n",
    "    explain_query_execution = db.command( # sends a database command directly to the MongoDB server\n",
    "        'explain', { # return information about how MongoDB executes a query or command without actually running it\n",
    "            'aggregate': collection.name, # specifies the name of the collection on which the aggregation is performed\n",
    "            'pipeline': pipeline, # the aggregation pipeline to analyze\n",
    "            'cursor': {} # indicates that default cursor behavior should be used\n",
    "        }, \n",
    "        verbosity='executionStats') # detailed statistics about the execution of each stage of the aggregation pipeline\n",
    "\n",
    "    vector_search_explain = explain_query_execution['stages'][0]['$vectorSearch']\n",
    "    millis_elapsed = vector_search_explain['explain']['collectStats']['millisElapsed']\n",
    "\n",
    "    print(f\"Total time for the execution to complete on the database server: {millis_elapsed} milliseconds\")\n",
    "\n",
    "    return list(results)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3423fc28-ee46-4c66-8e35-c757b9d3e4d6",
   "metadata": {},
   "source": [
    "## Handling User Query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "523d5be8-f399-4f73-b7f1-0f3b1f694f24",
   "metadata": {
    "height": 166
   },
   "outputs": [],
   "source": [
    "from pydantic import BaseModel\n",
    "from typing import Optional\n",
    "\n",
    "class SearchResultItem(BaseModel):\n",
    "    name: str\n",
    "    accommodates: Optional[int] = None\n",
    "    bedrooms: Optional[int] = None\n",
    "    address: custom_utils.Address\n",
    "    space: str = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33e7149f-8792-4e92-96c1-e5c06a1a59f9",
   "metadata": {
    "height": 778
   },
   "outputs": [],
   "source": [
    "from IPython.display import display, HTML\n",
    "\n",
    "def handle_user_query(query, db, collection, stages=[], vector_index=\"vector_index_text\"):\n",
    "    # Assuming vector_search returns a list of dictionaries with keys 'title' and 'plot'\n",
    "    get_knowledge = vector_search(query, db, collection, stages, vector_index)\n",
    "\n",
    "    # Check if there are any results\n",
    "    if not get_knowledge:\n",
    "        return \"No results found.\", \"No source information available.\"\n",
    "\n",
    "    # Convert search results into a list of SearchResultItem models\n",
    "    search_results_models = [\n",
    "        SearchResultItem(**result)\n",
    "        for result in get_knowledge\n",
    "    ]\n",
    "\n",
    "    # Convert search results into a DataFrame for better rendering in Jupyter\n",
    "    search_results_df = pd.DataFrame([item.dict() for item in search_results_models])\n",
    "\n",
    "    # Generate system response using OpenAI's completion\n",
    "    completion = custom_utils.openai.chat.completions.create(\n",
    "        model=\"gpt-3.5-turbo\",\n",
    "        messages=[\n",
    "            {\n",
    "                \"role\": \"system\", \n",
    "                \"content\": \"You are a airbnb listing recommendation system.\"},\n",
    "            {\n",
    "                \"role\": \"user\", \n",
    "                \"content\": f\"Answer this user query: {query} with the following context:\\n{search_results_df}\"\n",
    "            }\n",
    "        ]\n",
    "    )\n",
    "\n",
    "    system_response = completion.choices[0].message.content\n",
    "\n",
    "    # Print User Question, System Response, and Source Information\n",
    "    print(f\"- User Question:\\n{query}\\n\")\n",
    "    print(f\"- System Response:\\n{system_response}\\n\")\n",
    "\n",
    "    # Display the DataFrame as an HTML table\n",
    "    display(HTML(search_results_df.to_html()))\n",
    "\n",
    "    # Return structured response and source info as a string\n",
    "    return system_response"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ad511fb-dbfc-42af-9d4f-9ec5051ec0b4",
   "metadata": {},
   "source": [
    "## Adding A Post Filter to Vector Search (Match Operator)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf8afe7b-6f03-41c9-a76b-10eb22289b07",
   "metadata": {
    "height": 251
   },
   "outputs": [],
   "source": [
    "import re\n",
    "# Specifying the metadata field to limit documents on\n",
    "search_path = \"address.country\"\n",
    "\n",
    "# Create a match stage\n",
    "match_stage = {\n",
    "    \"$match\": {\n",
    "       search_path: re.compile(r\"United States\"),\n",
    "       \"accommodates\": { \"$gt\": 1, \"$lt\": 5}\n",
    "    }\n",
    "}\n",
    "\n",
    "additional_stages = [match_stage]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0587bca1-64cd-46be-af22-4d7fafe7fb94",
   "metadata": {
    "height": 132
   },
   "outputs": [],
   "source": [
    "query = \"\"\"\n",
    "I want to stay in a place that's warm and friendly, \n",
    "and not too far from resturants, can you recommend a place? \n",
    "Include a reason as to why you've chosen your selection\"\n",
    "\"\"\"\n",
    "handle_user_query(query, db, collection, additional_stages)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "105c41f5-e14e-4254-a10c-df65207128fd",
   "metadata": {},
   "source": [
    "## Adding A PreFilter to Vector Search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6982df29-cc4f-4c17-88e9-97dde04a66e7",
   "metadata": {
    "height": 625
   },
   "outputs": [],
   "source": [
    "from pymongo.operations import SearchIndexModel\n",
    "import time \n",
    "\n",
    "vector_index_with_filter = \"vector_index_with_filter\"\n",
    "\n",
    "new_vector_search_index_model = SearchIndexModel(\n",
    "    definition={\n",
    "        \"mappings\": {\n",
    "            \"dynamic\": True,\n",
    "            \"fields\": {\n",
    "                \"text_embeddings\": {\n",
    "                    \"dimensions\": 1536,\n",
    "                    \"similarity\": \"cosine\",\n",
    "                    \"type\": \"knnVector\",\n",
    "                },\n",
    "                 \"accommodates\": {\n",
    "                    \"type\": \"number\"\n",
    "                },\n",
    "                \"bedrooms\": {\n",
    "                    \"type\": \"number\"\n",
    "                },\n",
    "            },\n",
    "        }\n",
    "    },\n",
    "    name=vector_index_with_filter,\n",
    ")\n",
    "\n",
    "# Create the new index\n",
    "try:\n",
    "    result = collection.create_search_index(model=new_vector_search_index_model)\n",
    "    print(\"Creating index...\")\n",
    "    time.sleep(20)  # Sleep for 20 seconds, adding sleep to ensure vector index has compeleted inital sync before utilization\n",
    "    print(\"New index created successfully:\", result)\n",
    "except Exception as e:\n",
    "    print(f\"Error creating new vector search index: {str(e)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1db78472-91f8-4195-bbee-a04f79ff56cd",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px\"> ⏳ <b>Note:</b> If the output of the previous cell is <code>Error creating vector search index: Duplicate Index</code> you may proceed to the next cell if you intend to still use a previously created index.</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18571849-7fc2-4dfd-889d-5fc7eee0169e",
   "metadata": {
    "height": 625
   },
   "outputs": [],
   "source": [
    "def vector_search(user_query, db, collection, additional_stages=[], vector_index=\"vector_index_text\"):\n",
    "    query_embedding = custom_utils.get_embedding(user_query)\n",
    "    if query_embedding is None:\n",
    "        return \"Invalid query or embedding generation failed.\"\n",
    "\n",
    "    vector_search_stage = {\n",
    "        \"$vectorSearch\": {\n",
    "            \"index\": vector_index,  # specifies the index to use for the search\n",
    "            \"queryVector\": query_embedding,  # the vector representing the query\n",
    "            \"path\": \"text_embeddings\",  # field in the documents containing the vectors to search against\n",
    "            \"numCandidates\": 150,  # number of candidate matches to consider\n",
    "            \"limit\": 20,  # return top 20 matches\n",
    "            \"filter\": {\n",
    "                \"$and\": [\n",
    "                    {\"accommodates\": {\"$gte\": 2}}, \n",
    "                    {\"bedrooms\": {\"$lte\": 7}}\n",
    "                ]\n",
    "            },\n",
    "        }\n",
    "    }\n",
    "    pipeline = [vector_search_stage] + additional_stages\n",
    "    results = collection.aggregate(pipeline)\n",
    "    explain_query_execution = db.command( # sends a database command directly to the MongoDB server\n",
    "        'explain', { # return information about how MongoDB executes a query or command without actually running it\n",
    "            'aggregate': collection.name, # specifies the name of the collection on which the aggregation is performed\n",
    "            'pipeline': pipeline, # the aggregation pipeline to analyze\n",
    "            'cursor': {} # indicates that default cursor behavior should be used\n",
    "        }, \n",
    "        verbosity='executionStats') # detailed statistics about the execution of each stage of the aggregation pipeline\n",
    "\n",
    "    vector_search_explain = explain_query_execution['stages'][0]['$vectorSearch']\n",
    "    millis_elapsed = vector_search_explain['explain']['collectStats']['millisElapsed']\n",
    "\n",
    "    print(f\"Total time for the execution to complete on the database server: {millis_elapsed} milliseconds\")\n",
    "    return list(results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a30a104e-344c-4fe9-9804-b418247206a6",
   "metadata": {
    "height": 217
   },
   "outputs": [],
   "source": [
    "query = \"\"\"\n",
    "I want to stay in a place that's warm and friendly, \n",
    "and not too far from resturants, can you recommend a place? \n",
    "Include a reason as to why you've chosen your selection\"\n",
    "\"\"\"\n",
    "handle_user_query(\n",
    "    query, \n",
    "    db, \n",
    "    collection, \n",
    "    vector_index=vector_index_with_filter\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "464d0fa5-f059-4139-b0a3-29c8442ade7d",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0e4e796-33de-4358-bfe3-3383e5711f58",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe6e7f34-8ef4-41d8-9d3e-eb17ae7028ec",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1905a0f-5237-4cc7-a364-a60f6edbc9a2",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e74897b-ae93-41cd-86f6-e2659bda159c",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2eaa31c-5c5e-4624-a2cd-0a40938ca9fb",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
